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Abstract 

Error correction of sequenced reads remains a difficult task, especially in single-cell sequencing projects 
with extremely non-uniform coverage. While existing error correction tools designed for standard (multi-cell) 
sequencing data usually come up short in single-cell sequencing projects, algorithms actually used for 
single-cell error correction have been so far very simplistic. 

We introduce several novel algorithms based on Hamming graphs and Bayesian subclustering in our 
new error correction tool BayesHammer. While BayesHammer was designed for single-cell sequencing, we 
demonstrate that it also improves on existing error correction tools for multi-cell sequencing data while 
working much faster on real-life datasets. We benchmark BayesHammer on both £:-mer counts and actual 
assembly results with the SPAdes genome assembler. 

Background 

Single-cell sequencing [1.2] based on the Multiple Displacement Amplification (MDA) technology |l}|3j 
allows one to sequence genomes of important uncultivated bacteria that until recently had been viewed 
as unamenable to genome sequencing. Existing metagenomic approaches (aimed at genes rather than 
genomes) are clearly limited for studies of such bacteria despite the fact that they represent the majority of 
species in such important studies as the Human Microbiome Project (4j[5| or discovery of new antibiotics- 



1 



producing bacteria |6j. 

Single-cell sequencing datasets have extremely non-uniform coverage that may vary from ones to thou- 
sands along a single genome (Fig. [TJ. For many existing error correction tools, most notably Quake j7j, 
uniform coverage is a prerequisite: in the case of non-uniform coverage they either do not work or produce 
poor results. 

Error correction tools usually attempt to correct the set of ^-character substrings of reads called k-mers 
and then propagate corrections to whole reads which are important to have for many assemblers. Error 
correction tools often employ a simple idea of discarding rare fc-mers, which obviously does not work in 
the case of non-uniform coverage. 

Medvedev et al. [8 1 recently presented a new approach to error correction for datasets with non-uniform 
coverage. Their algorithm Hammer makes use of the Hamming graph (hence the name) on A:-mers (vertices 
of the graph correspond to fc-mers and edges connect pairs of fc-mers with Hamming distance not exceeding 
a certain threshold). Hammer employs a simple and fast clustering technique based on selecting a central 
k-mer in each connected component of the Hamming graph. Such central A:-mers are assumed to be error-free 
(i.e., they are assumed to actually appear in the genome), while the other A:-mers from connected components 
are assumed to be erroneous instances of the corresponding central A:-mers. However, Hammer may be 
overly simplistic: in connected components of large diameter or connected components with several A:-mers 
of large multiplicities, it is more reasonable to assume that there are two or more central A:-mers (rather than 
one as in Hammer). Biologically, such connected components may correspond to either (1) repeated regions 
with similar but not identical genomic sequences (repeats) which would be bundled together by existing 
error correction tools (including Hammer); or (2) artificially united A:-mers from distinct parts of the genome 
that just happen to be connected by a path in the Hamming graph (characteristic to Hammer). 

In this paper, we introduce the BayesHammer error correction tool that does not rely on uniform coverage. 
BayesHammer uses the clustering algorithm of Hammer as a first step and then refines the constructed 
clusters by further subclustering them with a procedure that takes into account reads quality values (e.g., 
provided by Illumina sequencing machines) and introduces Bayesian (BIC) penalties for extra subclustering 
parameters. BayesHammer subclustering aims to capture the complex structure of repeats (possibly of 
varying coverage) in the genome by separating even very similar A:-mers that come from different instances 
of a repeat. BayesHammer also uses a new approach for propagating corrections in A:-mers to corrections 
in the reads. All algorithms in BayesHammer are heavily parallelized whenever possible; as a result, 
BayesHammer gains a significant speedup with more processing cores available. These features make 
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Figure 1: Logarithmic coverage plot for the single-cell E. coli dataset (similar plot is also given in |2] 



BayesHammer a perfect error correction tool for single-cell sequencing. 

We remark that Hammer produces only a set of central A:-mers but does not correct reads, making it 
incompatible with most genome assemblers. Quake does correct reads but has severe memory limitations 
for large k and assumes uniform coverage. In contrast, EULER-SR 1 9 [ and Camel 1 2 ] correct reads and do not 
make strong assumptions on coverage (both tools have been used for single-cell assembly projects 1 2 ] ) which 
makes these tools suitable for comparison to BayesHammer. Our benchmarks show that BayesHammer 
outperforms these tools in both single-cell and standard (multi-cell) modes. We further couple BayesHammer 
with a recently developed genome assembler SPAdes 1 10] and demonstrate that assembly of BayesHammer- 
corrected reads significantly improves upon assembly with reads corrected by other tools for the same 
datasets, while the total running time also improves significantly. 

BayesHammer is freely available for download as part of the SPAdes genome assembler at http://bioinf. 



spbau.ru/spades/ 



Methods 
Notation and outline 

Let E = {A, C, G, T] be the alphabet of nucleotides (BayesHammer discards A:-mers with uncertain bases 
denoted N). A k-mer is an element of L k , i.e., a string of k nucleotides. We denote the z th letter (nucleotide) 
of a A:-mer x by x[i], indexing them from zero: < i < k — 1. A subsequence of x corresponding to a set of 
indices I is denoted by x[I]. We use interval notation [i, j] for intervals of integers {i, i + and further 

abbreviate x[i, j] = x [\i, i + 1, . .., j}]; thus, x = x[0,k - 1]. Input reads are represented as a set of strings 
R c E* along with their quality values ((jrU\)J^ for each r e R. We assume that q r [i] estimates the probability 
that there has been an error in position i of read r. Notice that in practice, the fastq file format 1 11 ] contains 
characters that encode probabilities on a logarithmic scale (in particular, products of probabilities used 
below correspond to sums of actual quality values). 

Below we give an overview of BayesHammer workflow (Fig. |2]| and refer to subsequent sections 
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Figure 2: BayesHammer workflow. 

for further details. On Step (1), A:-mers in the reads are counted, producing a triple statistics(x) - 
(count x , quality x , error*) for each fc-mer x. Here, count x is the number of times x appears as a substring 
in the reads, quality x is its total quality expressed as a probability of sequencing error in x, and error* is a 
A:-dimensional vector that contains products of error probabilities (sums of quality values) for individual 
nucleotides of x across all its occurrences in the reads. On Step (2), we find connected components of the 
Hamming graph constructed from this set of A:-mers. On Step (3), the connected components become subject 
to Bayesian subclustering; as a result, for each A:-mer we know the center of its subcluster. On Step (4), we 
filter subcluster centers according to their total quality and form a set of solid A;-mers which is then iteratively 
expanded on Step (5) by mapping them back to the reads. Step (6) deals with reads correction by counting 
the majority vote of solid A;-mers in each read. In the iterative version, if there has been a substantial amount 
of changes in the reads, we run the next iteration of error correction; otherwise, output the corrected reads. 
Below we describe specific algorithms employed in the BayesHammer pipeline. 

Algorithms 

Step (1): computing k-mer statistics. 

To collect fc-mer statistics, we use a straightforward hash map approach |12") that does not require storing 
instances of all A:-mers in memory (as excessive amount of RAM might be needed otherwise). For a certain 
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Figure 3: Hamming graphs HGi(X) and HG 2 (X) for X being the set of 4-mers of the reads ACGTGTG, ACATGTG, 
ACCTGTC. Blue edges denote Hamming distance 2. 

positive integer N (the number of auxiliary files), we use a hash function h : l} — » Z N that maps A:-mers 
over the alphabet £ to integers from to N - 1. 



Algorithm 1 Count A;-mers 

for each A:-mer x from the reads R: do 
compute h(x) and write x to File/,^). 

forze[0,N-l]:do 

sort File, with respect to the lexicographic order; 

reading File, sequentially, compute steh'sfr'cs(s) for each A;-mer s from File,. 



Step (2): Constructing connected components of Hamming graph. 

Step (2) is the essence of the Hammer approach j8}. The Hamming distance between A:-mers x,y e L k is the 
number of nucleotides in which they differ: 

d(x,y) = \{ie[Q,k-l) : x[i]*y\i]}\. 

For a set of A:-mers X, the Hamming graph HG T (X) is an undirected graph with the set of vertices X and edges 
corresponding to pairs of A:-mers from X with Hamming distance at most t, i.e., x, y e X are connected by 
an edge in HG T (X) iff d(x, y) < x (Fig.|3j. To construct HG T (X) efficiently, we notice that if two A:-mers are at 
Hamming distance at most t, and we partition the set of indices [0,k - 1] into t + 1 parts, then at least one 
part corresponds to the same subsequence in both A:-mers. Below we assume with little loss of generality 
that t + 1 divides k, i.e., k = a ■ (x + 1) for some integer a. 

For a subset of indices 1 c [0,k - 1], we define a partial lexicographic ordering <i as follows: x <i y iff 
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Algorithm 2 Hamming graph processing 



procedure HGProcess(X, max_quadratic) 

Init components with singletons X = {{x} : x e X}. 
for all Y e FindBlocks(X, U s cnt } s T =0 ) do 
if |Y| > max.quadratic then 

for all Z 6 FindBlocks(Y, {If } S T =0 ) do 
ProcessExhaustively(Z, X) 

else 

ProcessExhaustively(Y, X). 



function FindBlocks(X, U s }J =0 ) 
for s = 0, . . . , t do 

sort a copy of X with respect to <i s , getting X, 

for s = 0, . . . , t do 

output the set of equiv. blocks {Y} w.r.t. =j s . 

procedure ProcessExhaustively(Y, X) 
for each pair x, y 6 Y do 

if d(x, y) < t then join their sets in X: 
for all x e Z x e X,y e Z y e X do 

X:=Xu{Z x UZ v }\{Z x ,ZJ. 



X U] < y[I\t where < is the lexicographic ordering on E*. Similarly we define a partial equality -j such that 
x -j y iff x[I] = y[I] . We partition the set of indices [0, k — 1] into t + 1 parts of size a and for each part I, sort 
a separate copy of X with respect to <i. As noticed above, for every two A:-mers x, y e X with d(x, y) < %, 
there exists a part I such that x =j y. It therefore suffices to separately consider blocks of equivalent A:-mers 
with respect to =j for each part I. If a block is small (i.e., of size smaller than a certain threshold), we go 
over the pairs of A;-mers in this block to find those with Hamming distance at most t. If a block is large, we 
recursively apply to it the same procedure with a different partition of the indices. In practice, we use two 
different partitions of [0,k — 1]: the first corresponds to contigious subsets of indices (recall that a = r^j): 



rent _ 
s — 



{so, SO + 1, . . . , 



so + a - 1}, s = 0, . . . ,t, 



while the second corresponds to strided subsets of indices: 




{s, s + t + 1, s + 2(t + 1), s + (cr - 1)(t + 1)}, s = 0, . . . , T. 



BayesHammer uses a two-step procedure, first splitting with respect to {I!r nt }J =0 (Fig- |4J and then, if an 
equivalence block is large, with respect to {If T V S=0 - On the block processing step, we use the disjoint set data 
structure |12 | to maintain the set of connected components. Step (2) is summarized in Algorithm|2] 
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Figure 4: Partial lexicographic orderings of a set X of 9-mers with respect to the index sets Z™' = {0,1/2}, 
J™' = {3,4,5}, and I™' = {6,7,8}. Red dotted lines indicate equivalence blocks. 

Step (3): Bayesian subclustering. 

In Hammer's generative model |8 |, it is assumed that errors in each position of a A:-mer are independent and 
occur with the same probability e, which is a fixed global parameter (Hammer used e = 0.01). Thus, the 
likelihood that a A:-mer x was generated from a A:-mer y under Hammer's model equals 

L HAMMER (x\y) = (l-e) k - d(xAj) e d(x ' y) - 

Under this model, the maximum likelihood center of a cluster is simply its consensus string |8 |. 

In BayesHammer, we further elaborate upon Hammer's model. Instead of a fixed e, we use reads quality 
values that approximate probabilities q x [i] of a nucleotide at position i in the A:-mer x being erroneous. We 
combine quality values from identical A:-mers in the reads: for a multiset of A:-mers X that agree on the j 
nucleotide, it is erroneous with probability Yl x€X q x [j]- 

The likelihood that a A:-mer x has been generated from another A:-mer c (under the independent errors 
assumption) is given by 

L(x\c)= 11 q x [j] [] t 1 -^M), 

x[j\2c[j] j: x[j]=c[j] 

and the likelihood of a specific subclustering C = C% U . . . U C m is 

m 

L,„(Ci, . . . ,C m ) = Y\ Y[ L ( x I c <) 

where c, is the center (consensus string) of the subcluster Q. 

In the subclustering procedure (see Algorithm|3j, we sequentially subcluster each connected component 
of the Hamming graph into more and more clusters with the classical A:-means clustering algorithm (denoted 
m-means since k has different meaning). For the objective function, we use the likelihood as above penalized 
for overfitting with the Bayesian information criterion (BIC) |13) . In this case, there are |C| observations in 
the dataset, and the total number of parameters is 3km + m - 1 : 
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Algorithm 3 Bayesian subclustering 

for all connected components C of the Hamming graph do 
m := 1 

t\ := 2 log Li(C) (likelihood of the cluster generated by the consensus) 
repeat 

m := m + 1 

do wz-means clustering of C = Ci U . . . U C m w.r.t. the Hamming distance; the initial approximation 
to the centers is given by A:-mers that have the least error probability 
t m := 2 • log L m (Ci, . . . ,C m ) - (3km + m - 1) • log |C| 
until t m < l m _\ 

output the best found clustering C = C\ U . . . U C m _i 

• m - 1 for probabilities of subclusters, 

• km for cluster centers, and 

• 2km for error probabilities in each letter: there are 3 possible errors for each letter, and the probabilities 
should sum up to one. Here error probabilities are conditioned on the fact that an error has occurred 
(alternatively, we could consider the entire distribution, including the correct letter, and get 3km 
parameters for probabilities but then there would be no need to specify cluster centers, so the total 
number is the same). 

Therefore, the resulting objective function is 

e m :=2- log L m (d, . . . ,C m ) - (3km + m - 1) • log \C\ 
for subclustering into m clusters; we stop as soon as l m ceases to increase. 

Steps (4) and (5): selecting solid k-mers and expanding the set of solid k-mers. 

We define the quality of a A:-mer x as the probability that it is error-free: f x = 0y=o (1 ~ tfxlj]) ■ The fc-mer 
qualities are computed on Step (1) along with computing A:-mer statistics. Next, we (generously) define the 
quality of a cluster C as the probability that at least one fc-mer in C is correct: 

p C =i-H(i-p x ). 

xeC 

In contrast to Hammer, we do not distinguish whether the cluster is a singleton (i.e., |C| = 1); there may be 
plenty of superfluous clusters with several A:-mers obtained by chance (actually, it is more likely to obtain a 
cluster of several A:-mers by chance than a singleton of the same total multiplicity). 

Initially we mark as solid the centers of the clusters whose total quality exceeds a predefined threshold 
(a global parameter for BayesHammer, set to be rather strict). Then we expand the set of solid A:-mers 
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Algorithm 4 Solid A:-mers expansion 

procedure IterattveExpansion(R, X) 
while ExpansionStep(R, X) do 

function ExpansionStep(R, X) 
for all reads reRdo 

if r is completely covered by solid A:-mers then 
mark all A:-mers in r as solid 
Return true if X has increased and false otherwise. 



Algorithm 5 Reads correction 

Input: reads R, solid A:-mers X, clusters C. 
for all reads r e R do 

init consensus array v : [0, \r\ - 1] x [A, C, G, T] — > N with zeros: v(j, x[i]) := for all i = 0, . . . ,\r\ - 1 and 
j = 0,...,k-l 

for i = 0, . . . , |r| - k do 

if r[i, i + k - 1] e X (it is solid) then 
for e [i,i + k- 1] do 
v(j,r[i]) :=v(j,r[i]) + l 

if r[i, i + k — l]eC for some CeC then 
let x be the center of C 

if x e X (r belongs to a cluster with solid center) then 
for j e [i,i + k- 1] do 
v(j,x[i\) := v(j,x[i]) + 1 

forz e [0, |r| - 1] do 

r[i] := argmax aeZ: i?(z,fl). 



iteratively: if a read is completely covered by solid A:-mers we conclude that it actually comes from the 
genome and mark all other A:-mers in this read as solid, too (Algorithm 

Step (6): reads correction. 

After Steps (l)-(5), we have constructed the set of solid A:-mers that are presumably error-free. To construct 
corrected reads from the set of solid A:-mers, for each base of every read, we compute the consensus of all 
solid A:-mers and solid centers of clusters of all non-solid fc-mers covering this base (Fig. |5j. This step is 
formally described as Algorithm [5] 

Results and discussion 
Datasets 

In our experiments, we used three datasets from |2|: a single-cell E. coli, a single-cell S. aureus, and a 
standard (multicell) E. coli dataset. Paired-end libraries were generated by an Illumina Genome Analyzer 
IIx from MDA-amplified single-cell DNA and from multicell genomic DNA prepared from cultured E. coli, 
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Figure 5: Reads correction. Grey A:-mers indicate non-solid A:-mers. Red A:-mers are the centers of the 
corresponding clusters (two grey A:-mers striked through on the right are non-solid singletons). As a result, 
one nucleotide is changed. 

respectively These datasets consist of lOObp paired-end reads with insert size 220; both E. coli datasets have 
average coverage ~ 600x, although the coverage is highly non-uniform in the single-cell case. 

In all experiments, BayesHammer used k = 21 (we observed no improvements for higher values of k). 

fc-mer counts 

Table [T] shows error correction statistics produced by different tools on all three datasets. For a compari- 
son with Hammer, we have emulated Hammer with read correction by turning off Bayesian subclustering 
(HammerExpanded in the table) and both Bayesian subclustering and read expansion, another new idea of 
BayesHammer (HammerNoExpansion in the table). Note that despite its more complex processing, BayesHam- 
mer is significantly faster than other error correction tools (except, of course, for Hammer which is a strict 
subset of BayesHammer processing in our experiments and is run on BayesHammer code). BayesHammer 
also produces, in the single-cell case, a much smaller set of A;-mers in the resulting reads which leads to 
smaller de Bruijn graphs and thus reduces the total assembly running time. Since BayesHammer trims only 
bad quality bases and does not, like Quake, trim bases that it has not been able to correct (it has been proven 
detrimental for single-cell assembly in our experiments), it does produce a much larger set of A:-mers than 
Quake on a multi-cell dataset. 

For a comparison of BayesHammer with other tools in terms of error rate reduction across an average 
read, see the logarithmic error rate graphs on Fig. [6] Note that we are able to count errors only for the 
reads that actually aligned to the genome, so the graphs are biased in this way. Note how the first 21 bases 
are corrected better than others in BayesHammer and both versions of Hammer since we have run it with 
k - 21; still, other values of k did not show a significant improvement in either A:-mer statistics or, more 
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Single-cell E. coli, total 4,543,849 genomic fc-mers 
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Table 1: A:-mer statistics. 



importantly, assembly results. 
Assembly results 

Tables [2] and [3] shows assembly results by the recently developed SPAdes assembler |10|; SPAdes was 
designed specifically for single-cell assembly, but has by now demonstrated state-of-the-art results on 
multi-cell datasets as well. 

In the tables, N50 is such length that contigs of that length or longer comprise > | of the assembly; NG50 
is a metric similar to N50 but only taking into account contigs comprising (and aligning to) the reference 
genome; NA50 is a metric similar to N50 after breaking up misassembled contigs by their misassemblies. 
NGx and NAx metrics have a more direct relevance to assembly quality than regular Nx metrics; our result 
tables have been produced by the recently developed tool QUAST 1 14 1. 

All assemblies have been done with SPAdes. The results show that after BayesHammer correction, 
assembly results improve significantly especially in the single-cell E. coli case; it is especially interesting to 
note that even in the multi-cell case, where BayesHammer loses to Quake by A:-mer statistics, assembly results 
actually improve over assemblies produced from QuAKE-corrected reads (including genome coverage and 



11 




Uncorrected 
Coral 
Camel 
EulerSR 
Hammer, no expansion 
Hammer, expanded 
BayesHammer 



10 20 30 40 50 60 70 80 



Read positions, single-cell E. coli dataset reads 




Uncorrected 
Coral 
Camel 
EulerSR 
Hammer, no expansion 
Hammer, expanded 
BayesHammer 



10 20 30 40 50 60 70 80 



Read positions, single-cell S. aureus dataset reads 




Uncorrected 
Quake 
Hammer, no expansion 
Hammer, expanded 
BayesHammer 



10 20 30 40 50 60 70 80 
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Figure 6: Error reduction by read position on logarithmic scale for the single-cell E. coli, single-cell S. aureus, 
and multi-cell E. coli datasets. 

the number of genes). 

Conclusions 

Single-cell sequencing presents novel challenges to error correction tools. In contrast to multi-cell datasets, 
for single-cell datasets, there is no pretty distribution of A:-mer multiplicities; one therefore has to work with Ai- 
mers on a one-by-one basis, considering each cluster of A:-mers separately In this work, we further developed 
the ideas of Hammer from a Bayesian clustering perspective and presented a new tool BayesHammer that 
makes them practical and yields significant improvements over existing error correction tools. 

There is further work to be done to make our underlying models closer to real life; for instance, one could 
learn a non-uniform distribution of single nucleotide errors and plug it in our likelihood formulas. Another 
natural improvement would be to try and rid the results of contamination by either human or some other 
DNA material; we observed significant human DNA contamination in our single-cell dataset, so weeding 
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Table 2: Assembly results, single-cell E. coli and S. aureus datasets (contigs of length > 200 are used). 
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o 
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X & 




Single-cell E. coli, reference length 4639675, reference GC content 50.79% 


# contigs (> 1000 bp) 


191 


158 


276 


224 


231 


150 


195 


282 


242 


173 


# contigs 


521 


462 


675 


592 


578 


375 


529 


655 


592 


477 


Largest contig 


269177 


284968 


179022 


179022 


267676 


267676 


268464 


210850 


210850 


268464 


Total length 


4952297 


4989404 


5064570 


5122860 


4817757 


4902434 


4977294 


5097148 


5340871 


5005022 


N50 


110539 


113056 


45672 


67849 


74139 


95704 


97639 


65415 


84893 


109826 


NG50 


112065 


118432 


55073 


87317 


77762 


108976 


101871 


68595 


96600 


112161 


NA50 


110539 


113056 


45672 


67765 


74139 


95704 


97639 


65415 


84841 


109826 


NGA50 


112064 


118432 


55073 


87317 


77762 


108976 


101871 


68594 


96361 


112161 


# misassemblies 


4 


6 


9 


12 


6 


8 


4 


4 


7 


7 


# misassembled contigs 


4 


6 


9 


10 


6 


8 


4 


4 


7 


7 


Misass. contigs length 


42496 


94172 


62114 


150232 


47372 


149639 


43304 


26872 


147140 


130706 


Genome covered (%) 


96.320 


96.315 


96.623 


96.646 


95.337 


95.231 


96.287 


96.247 


96.228 


96.281 


GC (%) 


49.70 


49.69 


49.61 


49.56 


49.90 


49.74 


49.68 


49.64 


49.60 


49.68 


# mismatches / lOOkbp 


11.22 


11.70 


8.36 


9.10 


5.55 


5.82 


12.77 


54.11 


52.48 


13.08 


# indels / lOOkbp 


1.07 


8.26 


9.17 


12.76 


0.52 


47.80 


0.91 


1.17 


7.96 


8.69 


# genes 


4065 + 


4079 + 


3998 + 


4040 + 


3992 + 


4020 + 


4068 + 


4034 + 


4048 + 


4078 + 




124 part 


110 part 


180 part 


143 part 


140 part 


107 part 


123 part 


152 part 


136 part 


111 part 




Single-cell S. aureus, reference length 2872769, reference GC content 32.75% 


# contigs (> 1000 bp) 


95 


85 


132 


113 


82 


70 


114 


272 


258 


101 


Total length (> 1000 bp) 


3019597 


3309342 


3055585 


3066662 


2972925 


2993100 


3033912 


3389846 


3405223 


3509555 


# contigs 


260 


241 


455 


423 


166 


134 


312 


721 


711 


292 


Largest contig 


282558 


328686 


208166 


208166 


254085 


535477 


282558 


148002 


166053 


328679 


Total length 


3081173 


3368034 


3160497 


3166169 


3008746 


3020256 


3111423 


3575679 


3594468 


3584266 


N50 


87684 


145466 


62429 


90701 


101836 


145466 


74715 


30788 


34943 


131272 


NG50 


112566 


194902 


87636 


99341 


108151 


159555 


88292 


39768 


45889 


180022 


NA50 


87684 


145466 


62429 


89365 


100509 


145466 


68711 


30788 


34552 


112801 


NGA50 


88246 


148064 


74452 


90101 


101836 


145466 


88289 


35998 


42642 


148023 


# misassemblies 


15 


17 


11 


14 


4 


5 


11 


14 


18 


14 


# misassembled contigs 


12 


14 


9 


10 


4 


5 


9 


14 


16 


12 


Misass. contigs length 


340603 


779785 


478009 


523596 


377133 


918380 


402997 


272677 


324361 


940356 


Genome covered (%) 


99.522 


99.483 


99.449 


99.447 


99.213 


99.254 


99.204 


98.820 


98.888 


99.221 


GC (%) 


32.67 


32.63 


32.64 


32.63 


32.66 


32.67 


32.67 


32.39 


32.38 


32.57 


# mismatches per 100 kbp 


3.18 


8.01 


12.44 


12.65 


9.72 


10.28 


17.38 


54.92 


55.50 


15.36 


# indels per 100 kbp 


2.17 


2.30 


15.50 


15.67 


3.80 


4.08 


3.57 


2.64 


2.72 


3.04 


# genes 


2540 + 


2547 + 


2532 + 


2540 + 


2547 + 


2550 + 


2535 + 


2477 + 


2485 + 


2539 + 




36 part 


30 part 


45 part 


37 part 


30 part 


27 part 


41 part 


91 part 


85 part 


38 part 



it out might yield a significant improvement. Finally a new general approach that we are going to try in 
our further work deals with the technique of minimizers introduced by Roberts et al. | |T5) . It may provide 
significant reduction in memory requirements and a possible approach to dealing with paired information. 
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Table 3: Assembly results, multi-cell E. coli dataset (contigs of length > 200 are used). 





u 

OJ 


u 

OJ 




C 

_o 


C 

o 






Statistics 


BayesHan 


BayesHan 
(scaffold) 


Hammer, 
expanded 


Hammer, 
no expans 


Hammer, 
no expans 
(scaffold) 


Hammer 
(scaffold) 


Quake 




Multi-cell E. coli, 600x coverage, re 


ference length 4639675, reference GC content 50.79 




# contigs (> 500 bp) 


103 


102 


119 


238 


213 


115 


165 


# contigs (> 1000 bp) 


91 


90 


99 


192 


171 


96 


156 


Total length (> 500 bp) 


4641845 


4641790 


4626515 


4730338 


4817457 


4627067 


4543682 


Total length (> 1000 bp) 


4633361 


4633306 


4611745 


4696966 


4787210 


4612838 


4537565 


# contigs 


122 


121 


146 


325 


303 


141 


204 


Largest contig 


285113 


285113 


218217 


210240 


210240 


218217 


165487 


Total length 


4647325 


4647270 


4635156 


4756088 


4844208 


4635349 


4555015 


N50 


132645 


132645 


113608 


59167 


73113 


113608 


58777 


NG50 


132645 


132645 


113608 


59669 


80085 


113608 


57174 


NA50 


132645 


132645 


113608 


59167 


73113 


113608 


58777 


NGA50 
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132645 


113608 


59669 
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57174 
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3 


3 


4 


4 


7 
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3 


3 


4 


4 


7 
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Misassembled contigs length 


44466 


44466 
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15259 
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Genome covered (%) 
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99.440 


99.383 


98.891 


98.925 


99.385 


98.747 


GC (%) 


50.78 


50.77 


50.77 


50.73 


50.71 


50.77 


50.75 


N's (%) 


0.00000 


0.00000 


0.00000 


0.00000 


0.00000 


0.00000 


0.00000 


# mismatches per 100 kbp 


8.55 


8.55 


13.76 


44.46 


44.33 


13.76 


1.21 


# indels per 100 kbp 


0.99 
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1.14 


0.76 
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